Towards Structure-sensitive Hypertext Categorization

نویسندگان

  • Alexander Mehler
  • Rüdiger Gleim
  • Matthias Dehmer
چکیده

Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Logical Hypertext Structure A Graph-Theoretic Perspective

Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bagof-words model has been utilized just as HTML tags and link structures. In spite of promising results this adaptation s...

متن کامل

Refined and Incremental Centroid-based approach for Genre Categorization of Web pages

In this paper, I propose a refined and incremental centroid-based approach for genre categorization of web pages. My approach is based on the construction of genre centroids using a set of training web pages. These centroids will be used to classify new web pages. The originality of my approach is the implementation of two new aspects, which are refining and incrementing. My approach is based o...

متن کامل

Classification Techniques for Categorization of Hypertext Documents

In this paper we investigate techniques for categorization of hypertext documents. Recent years have witnessed a growing interest in applying text categorization techniques to the Web. However, the semi-structured nature of the Web along with diverse subject matter present in it pose interesting challenges for conventional classification techniques. In this paper, we review some of the techniqu...

متن کامل

Hypertext Classification Using Tensor Space Model and Rough Set Based Ensemble Classifier

As WWW grows at an increasing speed, a classifier targeted at hypertext has become in high demand. While document categorization is quite a mature, the issue of utilizing hypertext structure and hyperlinks has been relatively unexplored. In this paper, we introduce tensor space model for representing hypertext documents. We exploit the local-structure and neighborhood recommendation encapsulate...

متن کامل

DHCS: A Case of Knowledge Share in Cooperative Computing Environment

Large-scale hypertext categorization has become one of the key techniques in web-based information acquisition. How to implement efficient hypertext categorization is still an ongoing research issue. This paper introduces the Distributed Hypertext Categorization System (DHCS), in which the Directed Acyclic Graph Support Vector Machines (DAGSVM) for learning multiclass hypertext classifiers is i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005